This notebook presents an analysis of the data on 17007 strategy games available on the Apple App Store, such as Clash of CLans, Plants vs Zombies, Pokemon GO and others. This dataset was acquired from Kaggle.com, and it was collected on the 3rd of August 2019 using the iTunes API.
With this dataset, we may be able to analyze what factors make a sucessful game.
To start this analysis, we first load the required packages (tidyverse, readr) and read the csv file provided by Kaggle.
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(readr)){install.packages("readr")}
if(!require(DT)){install.packages("DT")}
options(scipen=10000)
appstoreGamesFile = "data/appstore_games.csv"
appstoreGamesDF = read_csv(appstoreGamesFile) %>% rename_all(~str_replace_all(., "\\s+", ""))
summary(appstoreGamesDF)
## URL ID Name
## Length:17007 Min. : 284921427 Length:17007
## Class :character 1st Qu.: 899654330 Class :character
## Mode :character Median :1112286228 Mode :character
## Mean :1059613815
## 3rd Qu.:1286982837
## Max. :1475076711
##
## Subtitle IconURL AverageUserRating UserRatingCount
## Length:17007 Length:17007 Min. :1.000 Min. : 5
## Class :character Class :character 1st Qu.:3.500 1st Qu.: 12
## Mode :character Mode :character Median :4.500 Median : 46
## Mean :4.061 Mean : 3306
## 3rd Qu.:4.500 3rd Qu.: 309
## Max. :5.000 Max. :3032734
## NA's :9446 NA's :9446
## Price In-appPurchases Description
## Min. : 0.0000 Length:17007 Length:17007
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.8134
## 3rd Qu.: 0.0000
## Max. :179.9900
## NA's :24
## Developer AgeRating Languages
## Length:17007 Length:17007 Length:17007
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Size PrimaryGenre Genres
## Min. : 51328 Length:17007 Length:17007
## 1st Qu.: 22950144 Class :character Class :character
## Median : 56768954 Mode :character Mode :character
## Mean : 115706430
## 3rd Qu.: 133027072
## Max. :4005591040
## NA's :1
## OriginalReleaseDate CurrentVersionReleaseDate
## Length:17007 Length:17007
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
As seen by the summary, there are 18 columns in this dataset:
We need to fix the typing of some columns, such as the release dates.
fixedAppstoreGamesDF <- appstoreGamesDF %>%
mutate(OriginalReleaseDate = as.Date(OriginalReleaseDate, "%d/%m/%Y")) %>%
mutate(CurrentVersionReleaseDate = as.Date(CurrentVersionReleaseDate, "%d/%m/%Y")) %>%
mutate(AgeRating = factor(AgeRating, levels=c('4+','9+', '12+', '17+')))
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
appstoreGamesDF <- fixedAppstoreGamesDF
datatable(appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
summary(appstoreGamesDF)
## URL ID Name
## Length:17007 Min. : 284921427 Length:17007
## Class :character 1st Qu.: 899654330 Class :character
## Mode :character Median :1112286228 Mode :character
## Mean :1059613815
## 3rd Qu.:1286982837
## Max. :1475076711
##
## Subtitle IconURL AverageUserRating UserRatingCount
## Length:17007 Length:17007 Min. :1.000 Min. : 5
## Class :character Class :character 1st Qu.:3.500 1st Qu.: 12
## Mode :character Mode :character Median :4.500 Median : 46
## Mean :4.061 Mean : 3306
## 3rd Qu.:4.500 3rd Qu.: 309
## Max. :5.000 Max. :3032734
## NA's :9446 NA's :9446
## Price In-appPurchases Description
## Min. : 0.0000 Length:17007 Length:17007
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.8134
## 3rd Qu.: 0.0000
## Max. :179.9900
## NA's :24
## Developer AgeRating Languages Size
## Length:17007 4+ :11806 Length:17007 Min. : 51328
## Class :character 9+ : 2481 Class :character 1st Qu.: 22950144
## Mode :character 12+: 2055 Mode :character Median : 56768954
## 17+: 665 Mean : 115706430
## 3rd Qu.: 133027072
## Max. :4005591040
## NA's :1
## PrimaryGenre Genres OriginalReleaseDate
## Length:17007 Length:17007 Min. :2008-07-11
## Class :character Class :character 1st Qu.:2014-09-23
## Mode :character Mode :character Median :2016-07-09
## Mean :2016-03-04
## 3rd Qu.:2017-12-07
## Max. :2019-10-26
##
## CurrentVersionReleaseDate
## Min. :2008-08-01
## 1st Qu.:2016-04-17
## Median :2017-07-24
## Mean :2017-04-26
## 3rd Qu.:2018-11-19
## Max. :2019-10-26
##
Right now I have no hypotheses to check, but lets create some plots to see the current state of the games released on the app store.
First, the number of games released each year. We can see by the plot that the number of games released had been increasing up until 2016. 2017 and 2018 had fewer games released. 2019 is not yet over, so it may catch up to the previous years.
appstoreGamesDF %>%
select(OriginalReleaseDate) %>%
mutate(OriginalReleaseYear = format(OriginalReleaseDate, "%Y")) %>%
group_by(OriginalReleaseYear) %>%
summarise(count = n()) %>%
ggplot(aes(x=OriginalReleaseYear, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
ylab("Number of games released") +
xlab("Year of Release") +
theme_minimal()
appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description) %>%
select(CurrentVersionReleaseDate) %>%
mutate(CurrentVersionRelease = format(CurrentVersionReleaseDate, "%Y")) %>%
group_by(CurrentVersionRelease) %>%
summarise(count = n()) %>%
ggplot(aes(x=CurrentVersionRelease, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
unique(appstoreGamesDF$AverageUserRating)
## [1] 4.0 3.5 3.0 2.5 NA 2.0 4.5 1.5 5.0 1.0
appstoreGamesDF %>%
select(AverageUserRating) %>%
filter(!is.na(AverageUserRating)) %>%
group_by(AverageUserRating) %>%
summarise(count = n()) %>%
ggplot(aes(x=AverageUserRating, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
scale_x_continuous(breaks = seq(1,5,by=0.5)) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
appstoreGamesDF %>%
select(AgeRating) %>%
arrange(AgeRating) %>%
group_by(AgeRating) %>%
summarise(count = n()) %>%
ggplot(aes(x=AgeRating, y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25) +
theme_minimal()
appstoreGamesDF %>%
select(ID, Languages) %>%
separate_rows(Languages, sep=",") %>%
drop_na(Languages) %>%
group_by(Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20) %>%
ggplot(aes(x=reorder(Languages,desc(count)), y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
## Warning: `list_len()` is deprecated as of rlang 0.2.0.
## Please use `new_list()` instead.
## This warning is displayed once per session.
## Selecting by count
appstoreGamesDF %>%
select(ID, Genres) %>%
separate_rows(Genres, sep=",") %>%
drop_na(Genres) %>%
group_by(Genres) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20) %>%
ggplot(aes(x=reorder(Genres,desc(count)), y=count)) +
geom_col() +
geom_text(aes(label=count), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
theme(axis.text.x = element_text(angle=90,vjust= 0.2,hjust=1))
## Selecting by count
appstoreGamesDF %>%
select(Size) %>%
filter(!is.na(Size))%>%
arrange(Size) %>%
ggplot(aes(x=Size)) +
geom_histogram(bins=30) +
theme_minimal()
appstoreGamesDF %>%
select(UserRatingCount) %>%
filter(!is.na(UserRatingCount))%>%
filter(UserRatingCount>=10000)%>%
arrange(UserRatingCount) %>%
ggplot(aes(x=UserRatingCount)) +
geom_histogram(bins=10) +
theme_minimal()
appstoreGamesDF %>%
select(Price) %>%
filter(!is.na(Price)) %>%
#(Price>00) %>%
ggplot(aes(x=Price))+
geom_histogram(bins=100)+
theme_minimal()
I removed the “Games”, “Entertainment” because they are not game genres. I also removed “Strategy” because the dataset is about games in this genre.
appstoreGamesDF %>%
select(ID,AgeRating, Genres) %>%
separate_rows(Genres, sep=",", convert = TRUE) %>%
mutate(Genres = trimws(Genres)) %>%
filter(Genres != "Strategy" & Genres != "Games" & Genres != "Entertainment") %>%
drop_na(Genres) %>%
group_by(AgeRating,Genres) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=5) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=count, fill=Genres)) +
geom_col(position = position_dodge(), width=0.9) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
## Selecting by count
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",", convert = TRUE) %>%
mutate(Languages = trimws(Languages)) %>%
# filter(Genres != "Strategy" & Genres != "Games" & Genres != "Entertainment") %>%
drop_na(Languages) %>%
group_by(AgeRating,Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=7) %>%
ggplot(aes(x=AgeRating, y=count, fill=Languages)) +
geom_col(position = position_dodge(), width=0.9) +
geom_text(aes(label=Languages), vjust=-0.25, size=3.5, position = position_dodge(0.9)) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
theme(legend.position = "none")
## Selecting by count
English is clearly the most popular language, followed by Chinese (ZH). Since it’s not possible to see the difference between the other language columns, lets create the same plot without English and Chinese.
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",", convert = TRUE) %>%
mutate(Languages = trimws(Languages)) %>%
filter(Languages != "EN" & Languages != "ZH") %>%
drop_na(Languages) %>%
group_by(AgeRating,Languages) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(n=7) %>%
ggplot(aes(x=AgeRating, y=count, fill=Languages)) +
geom_col(position = position_dodge(), width=0.9) +
geom_text(aes(label=Languages), vjust=-0.25, size=3.5, position = position_dodge(0.9)) +
scale_fill_brewer(palette="Set1") +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal() +
theme(legend.position = "none")
## Selecting by count
Unfortunately, there is no information regarding the revenue these games make. We can only speculate that any user that reviews a non-free game has bought it at least once. Thus, we can have model of how much money a game has made compared to others. Of course, this does not consider games with in-app purchases, which is not only the the most common type of game in the Apple Store, but they are also the games that usually make the most amount of money in the mobile gaming community according to the news.
With this crude model, we can relate how most variables impact the revenue of a game: e.g., the amount of languages, a specific language, the genres, the release date, the age rating, the app size, and maybe others.
Is there a correlation between age rating and the languages available.
# appstoreGamesDF %>%
# select(ID,AgeRating, Languages) %>%
# separate_rows(Languages, sep=",") %>%
# drop_na(Languages) %>%
# group_by(ID,AgeRating) %>%
# summarise(numberOfLanguages = n())
# arrange(desc(numberOfLanguages))
appstoreGamesDF %>%
select(ID,AgeRating, Languages) %>%
separate_rows(Languages, sep=",") %>%
drop_na(Languages) %>%
group_by(ID,AgeRating) %>%
summarise(numberOfLanguages = n()) %>%
#ungroup %>%
#group_by(AgeRating) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=numberOfLanguages)) +
geom_boxplot() +
geom_jitter(width = 0.3) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
coord_cartesian(ylim=c(0,90)) +
theme_minimal()
Is there a correlation between age rating and genre?
appstoreGamesDF %>%
select(ID,AgeRating, Genres) %>%
separate_rows(Genres, sep=",") %>%
drop_na(Genres) %>%
group_by(ID,AgeRating) %>%
summarise(numberOfGenres = n()) %>%
#summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
ggplot(aes(x=AgeRating, y=numberOfGenres)) +
geom_boxplot() +
geom_jitter(width = 0.3) +
#geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
theme_minimal()
To compare the User ratings for each Age Rating category, I summed the total amount of user ratings for each rating level and then calculated the ratio of that amount to the total amount of user ratings. This is displayed in the stacked bar chart below.
appstoreGamesDF %>%
drop_na(AverageUserRating) %>%
arrange(AverageUserRating) %>%
pull(AverageUserRating) %>%
unique() -> AverageUserRatingLevels #Get a vector containing all possible user rating levels in sequential order.
appstoreGamesDF %>%
select(ID,AgeRating, AverageUserRating) %>%
drop_na(AverageUserRating) %>%
mutate(AverageUserRating = factor(AverageUserRating, levels = AverageUserRatingLevels)) %>%
group_by(AgeRating, AverageUserRating) %>%
summarise(count = n()) %>%
mutate(freq = count / sum(count)) %>%
ggplot(aes(x=reorder(AgeRating,desc(AgeRating)), y=freq, fill=AverageUserRating)) +
geom_col(position = position_stack(reverse = TRUE)) +
scale_fill_brewer(palette = "RdYlGn") +
geom_text(aes(label=count), size=4 ,position=position_stack(vjust = .5, reverse = TRUE)) +
theme_minimal() +
xlab("Age Rating") +
ylab("Proportion (%)") +
labs(fill="Average\nUser Rating") +
coord_flip()
Is there a correlation between age rating and user rating count?
Is there a correlation between price and age rating? …between price and the presence of In-app Purchases, … between price and user rating count? … between price and language? … between prices and genre? … between price and release date? … between price and AppSize?
Is there a correlation between original release date and current version release date? … original release date and AppSize. … original release date and Genre … original release date and Language